Disclose Models, Hide the Data - How to Make Use of Confidential Corpora without Seeing Sensitive Raw Data
نویسندگان
چکیده
Confidential corpora from the medical, enterprise, security or intelligence domains often contain sensitive raw data which lead to severe restrictions as far as the public accessibility and distribution of such language resources are concerned. The enforcement of strict mechanisms of data protection consitutes a serious barrier for progress in language technology (products) in such domains, since these data are extremely rare or even unavailable for scientists and developers not directly involved in the creation and maintenance of such resources. In order to by-pass this problem, we here propose to distribute trained language models which were derived from such resources as a substitute for the original confidential raw data which remain hidden to the outside world. As an example, we exploit the access-protected German-language medical FRAMED corpus from which we generate and distribute models for sentence splitting, tokenization and POS tagging based on software taken from OPENNLP, NLTK and JCORE, our own UIMA-based text analytics pipeline.
منابع مشابه
Hiding Sensitive Association Rules without Altering the Support of Sensitive Item(s)
Association rule mining is an important data-mining technique that finds interesting association among a large set of data items. Since it may disclose patterns and various kinds of sensitive knowledge that are difficult to find otherwise, it may pose a threat to the privacy of discovered confidential information. Such information is to be protected against unauthorized access. Many strategies ...
متن کاملIntroducing an algorithm for use to hide sensitive association rules through perturb technique
Due to the rapid growth of data mining technology, obtaining private data on users through this technology becomes easier. Association Rules Mining is one of the data mining techniques to extract useful patterns in the form of association rules. One of the main problems in applying this technique on databases is the disclosure of sensitive data by endangering security and privacy. Hiding the as...
متن کاملData Envelopment Analysis with LINGO Modeling for Technical Educational Group of an Organization
Data Envelopment Analysis (DEA) was developed to help compare the relative performance of decision-making units. It is a non-parametric method for performing frontier analysis. It uses linear programming to estimate the efficiency of multiple decision-making units and it is commonly used in production, management and economics [3]. DEA generates an efficiency score between 0 and 1 for each unit...
متن کاملارایه یک روش جدید انتشار دادهها با حفظ محرمانگی با هدف بهبود دقّت طبقهبندی روی دادههای گمنام
Data collection and storage has been facilitated by the growth in electronic services, and has led to recording vast amounts of personal information in public and private organizations databases. These records often include sensitive personal information (such as income and diseases) and must be covered from others access. But in some cases, mining the data and extraction of knowledge from thes...
متن کاملP-9: Investigate The Effects of Polyphenol Gossypol as A Male Antifertility Agents,How to Make Male Contraceptive?
Background: Gossypol is a polyphenol isolated from the cotton plant (Gossypium sp.). The substance, shows promise for use as a male contraceptive, is a derivative of cottonseed oil. Various species of animals have been tested and some are more sensitive than others, and in men it causes spermatogenesis arrest at relatively low doses. Materials and Methods: This routine data base study re-examin...
متن کامل